Downloads & Proxy Management
Downloading files (PDF, images, ZIPs, …)
client.scrape(url, browser=False) handles binary responses natively since v0.6.0 — the response carries content_type, body_base64, and a requests.Response-style API (is_binary, body, text, save()). Same code path as a normal page scrape; the response shape tells you what came back.
resp = client.scrape("https://investors.example.com/charter.pdf", browser=False)
if resp.is_binary:
resp.save("charter.pdf") # one-liner write
# data = resp.body # bytes (mirrors requests.Response.content)
# ct = resp.content_type # "application/pdf"
else:
print(resp.content) # text response (markdown / html)
Use browser=False for direct file downloads — browser=True is 5 credits and adds no value when the target IS the file (it's only useful when the file is behind JavaScript / a viewer / SPA navigation). See the gotchas at the bottom of this page for that case.
resp.text returns None for binary responses (forces an explicit is_binary branch instead of silently parsing base64 as text). resp.body always returns bytes regardless of MIME — text responses are UTF-8-encoded for you.
With a proxy
Same kwarg as any scrape:
resp = client.scrape("https://example.com/file.pdf", browser=False, use_proxy="US")
resp.save("file.pdf")
Migrating from download()
client.download(url) is deprecated since v0.7.0 (still works, emits a DeprecationWarning; scheduled for removal in v1.0). The replacement is the same scrape(browser=False) call shown above — scrape returns binary content natively, so download no longer carries its weight.
# Before
result = client.download(url)
import base64
data = base64.b64decode(result.content)
# After
resp = client.scrape(url, browser=False)
resp.save("file.pdf") # or: data = resp.body
Batch downloads (many files at once)
submit_batch + iter_results() streams binary responses just like any other batch. Every yielded ScrapeResponse carries the same is_binary / save / body surface — no separate API for downloads.
from scrapingpros import AsyncClient
async def download_pdfs(urls, outdir):
items = [
{"url": u, "custom_id": doc_id, "browser": False}
for doc_id, u in urls.items()
]
async with AsyncClient(token) as client:
batch = await client.submit_batch("pdfs-daily", items)
async for r in batch.iter_results():
if not r.guidance.success:
log.warning("failed %s: %s", r.url,
r.guidance.error_type)
continue
if r.is_binary:
r.save(f"{outdir}/{r.custom_id}.pdf")
else:
# Server returned HTML instead of a file —
# usually a redirect / 404 page / login wall.
log.info("non-binary response from %s, skipping", r.url)
Memory stays constant — the streaming iterator never holds the full result list in RAM, so this scales to tens of thousands of files. Disk writes happen one at a time as each download completes.
If you need a list-return shape for simpler call sites:
results = client.batch_scrape([
{"url": u, "custom_id": doc_id, "browser": False}
for doc_id, u in urls.items()
])
for r in results:
if r.guidance.success and r.is_binary:
r.save(f"{outdir}/{r.custom_id}.pdf")
Crash-resilient batch downloads (since v0.7.5)
For pipelines that crash and restart, persist batch.last_completed_at alongside (collection_id, run_id) and resume from the cursor:
# Submit + persist as you go
batch = await client.submit_batch("pdfs-daily", items)
db.save(cid=batch.collection_id, rid=batch.run_id,
submitted=batch.submitted_count)
async for r in batch.iter_results():
if r.is_binary:
r.save(f"{outdir}/{r.custom_id}.pdf")
db.update(cid=batch.collection_id, cursor=batch.last_completed_at)
# After a restart, resume strictly after the saved cursor:
row = db.find(cid)
async for r in client.iter_results(row.cid, row.rid,
since=row.cursor,
submitted_count=row.submitted):
if r.is_binary:
r.save(f"{outdir}/{r.custom_id}.pdf")
See Batch API → Cross-process resume for the full pattern (including the same-millisecond ties caveat).
Large files (over the inline cap)
The jobs listing embeds bodies inline up to ~256 KB (server-side cap). Larger PDFs come back via the per-job result endpoint instead — the SDK does this automatically inside _build_result, so callers see the same ScrapeResponse shape with body_base64 / is_binary / save() populated. No special handling on your side. The only difference is one extra round-trip to fetch that body, which doesn't cost credits — only wire bandwidth.
Gotchas
JS-rendered PDF viewers
If the URL is a viewer wrapper (https://site.com/viewer.html?file=...), scrape(url, browser=False) returns the HTML of the viewer page, not the PDF. You need browser=True to render the wrapper, extract the real PDF URL from the DOM, then a second scrape with browser=False to download:
viewer = client.scrape(viewer_url, browser=True, actions=[
WaitForSelectorAction(selector="css:iframe[src*='.pdf']", time=8000),
])
real_pdf_url = extract_pdf_url(viewer.html)
pdf = client.scrape(real_pdf_url, browser=False)
pdf.save("doc.pdf")
This costs 5 credits for the render + 1 for the download = 6 total per file. Use it only when the direct path doesn't work.
Files behind auth / login walls
The SDK does not maintain a session between scrapes. If the PDF is behind a login form, do the login as a separate MethodPOST(content_type="form") scrape, extract the resulting auth cookie / token from network_capture, and pass it on the file download — or model the whole flow as a single browser=True scrape with actions=[InputAction(...), ClickAction(...), WaitForSelectorAction(...)]. There's no one-size-fits-all recipe here; it depends on the site.
Verifying you got the file you expected
r.content_type carries the MIME the server returned ("application/pdf", "image/png", "application/zip", …). Branch on it when you process a mixed list of URLs and want to route by file type:
async for r in batch.iter_results():
if not r.guidance.success or not r.is_binary:
continue
if r.content_type == "application/pdf":
r.save(f"pdfs/{r.custom_id}.pdf")
elif r.content_type and r.content_type.startswith("image/"):
ext = r.content_type.split("/")[1].split(";")[0]
r.save(f"images/{r.custom_id}.{ext}")
else:
r.save(f"misc/{r.custom_id}.bin")
Proxy management
List available countries
resp = client.list_proxy_countries()
print(resp.countries) # ["AD", "AE", "AF", ...] — 200+ countries
Request a country proxy
Country-specific proxies require approval:
resp = client.request_proxy_country("US", reason="Need US pricing data")
print(resp.status) # "pending" or "already_approved"
Check approval status
status = client.proxy_status()
print(status.approved_countries) # ["US", "BR"]
print(status.pending_countries) # ["DE"]
Use a country proxy
result = client.scrape("https://example.com", use_proxy="US")
Plans and billing
# View all plans
plans = client.plans()
# Check current month billing
billing = client.billing()
print(billing.month)
# Usage metrics
metrics = client.client_metrics(date="2026-04")
# API health
health = client.health()